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Abstract: Random graph mixture models are now very popular for mod- 
eling real data networks. In these setups, parameter estimation proce- 
dures usually rely on variational approximations, either combined with the 
expectation-maximisation (em) algorithm or with Bayesian approaches. De- 
spite good results on synthetic data, the validity of the variational approxi- 
mation is however not established. Moreover, the behavior of the maximum 
likelihood or of the maximum a posteriori estimators approximated by these 
procedures is not known in these models, due to the dependency structure 
on the variables. In this work, we show that in many different affiliation 
contexts (for binary or weighted graphs), estimators based either on mo- 
ment equations or on the maximization of some composite likelihood are 
strongly consistent and y'n-convergent, where n is the number of nodes. As 
a consequence, our result establishes that the overall structure of an affilia- 
tion model can be caught by the description of the network in terms of its 
number of triads (order 3 structures) and edges (order 2 structures). We 
illustrate the efficiency of our method on simulated data and compare its 
performances with other existing procedures. A data set of cross-citations 
among economics journals is also analyzed. 

Keywords and phrases: composite likelihood, random graph, mixture 
model, stochastic blockmodel. 



1. Introduction 

The analysis of network data appears in different scientific fields, such as social 
sciences, communication networks and many others, including a recent explo- 
sion in the field of molecular biology (with the study of metabolic networks, 
transcriptional regulatory networks and proteins interactions networks). The 
literature is vast, and we refer for instance to Boccaletti ct al. [2006], Goldcn- 
berg et al. [2010] and the book by Kolaczyk [2009] for interesting introductions 
to networks. 
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Erdos and Rcnyi [1959] introduced one of the earliest and most studied ran- 
dom graph model, in which binary random graphs are considered as a set of 
independent and identically distributed (i.i.d.) Bernoulli edge variables over a 
fixed set of nodes. This model is however too homogeneous to capture some im- 
portant features of real networks, such as the presence of 'hubs', namely highly 
connected nodes. This lack of heterogeneity led to the introduction of mixture 
versions of the simple Erdos-Renyi model. So-called 'stochastic blockmodels' 
[Daudin et al., 2008, Frank and Harary, 1982, Holland ct al., 1983, Snijders and 
Nowicki, 1997] were introduced in various forms, primarily in social sciences 
to study relational data. In this context, the nodes are partitioned into latent 
groups (blocks) characterizing the relations between nodes. Blockmodeling thus 
refers to the particular structure of the adjacency matrix of the graph (i.e. the 
matrix containing the edges indicators). By reordering the nodes with respect 
to the groups they belong to, this matrix exhibits blocks. Diagonal and off- 
diagonal blocks respectively represent intra-group and inter-group connections. 
In case where blocks exhibit the same behaviour within their type (diagonal or 
off-diagonal), we further obtain what we call an affiliation structure. Affiliation 
structures are parsimonious in the number of parameters they use and may 
model a lot of situations. For instance, affiliation models encompass both com- 
munity structures and disassortative mixing [Newman and Lcicht, 2007]. In the 
first case (community structure) the intra-group connectivities are high while 
the inter-group connectivities are low. Disassortative mixing rather corresponds 
to high inter-group connectivities and low intra-group connectivities. 

Many networks are or can be weighted (or in other words valued). Those 
weights are precious additional information on the graph and should be taken 
into account in their analysis. Well-known examples of weighted networks in- 
clude airline traffic data between airports, co-authorship networks of scientists 
[Barrat et al., 2004] or when rather considering the corresponding adjacency 
matrix, financial correlation matrices [Laloux ct al., 1999]. While the two first 
examples correspond to sparse weighted networks, the last one concerns dense 
(or complete) weighted graphs. Weighted networks are a way of integrating het- 
erogeneous data and their analysis is thus of primary importance [Newman, 
2004]. Community detection (i.e. the problem of finding clusters of nodes with 
many edges joining vertices of the same cluster and comparatively few edges 
joining vertices of different clusters) has been widely considered in the context 
of weighted graphs [see for instance Fortunato, 2010]. While community detec- 
tion methods are mainly algorithmic, another approach is to rely on generative 
models and random graphs mixtures. Stochastic blockmodels for analyzing ran- 
dom graphs with non binary relations between nodes have been considered either 
in the case of a finite number of possible relations [Nowicki and Snijders, 2001] 
or for more general weighted graphs [Alariadassou ct al., 2010]. Our approach 
builds on these latter references. We also point out the existence of general- 
ized blockmodels for valued networks [Ziberna, 2007, Doreian et al., 2005] which 
however do not rely on a probabilistic model as we shall do here. 

In this article, we will be interested in both binary and weighted random 
graphs and will focus on mixture models. We mention the existence of an in- 
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creasing literature on two different related concepts: mixed membership [Airoldi 
et al., 2008, Erosheva et al., 2004] and overlapping [Latouclie et al., 2011a] 
stochastic blockmodels for binary networks, in which nodes may belong to sev- 
eral classes. However these models are beyond the scope of the present work. 

Current estimation procedures in random graph mixture models rely on ap- 
proximations of the likelihood, which is itself intractable due to the presence 
of the non observed groups. Either expectation maximization (em) algorithm 
[Dempster et al., 1977], or Bayesian approaches are at the core of these strate- 
gies. Both rely on the computation of the distribution of the hidden nodes states, 
conditional on the observed edges variables. However, in the particular case of 
random graph mixtures, the exact computation of this conditional distribution 
can not be obtained, due to its non-factorized form. Thus, approximate compu- 
tations are made, leading to what is called 'variational' EM or Bayes strategies 
[Daudin et al., 2008, Latouchi' ci. al., 20111), Picard et al., 2009, Zanghi et al., 
2008]. The major drawback of these methods is their relatively large computa- 
tional time. Besides, even if these methods exhibit good behavior on simulated 
data, they suffer from a lack of theoretical support. Indeed, two major features 
of these procedures still lack understanding. First, the quality of the variational 
approximation is not known, and this approximation may even prevent con- 
vergence to local maxima of the likelihood [Gunawardaiia and Byrne, 2005]. 
Second, the consistency of the maximum likelihood or of the maximum a pos- 
teriori estimators is still an open question in these models. 

Here, we propose simple strategies for estimating the parameters of mixture 
random graph models, in the particular afhliation case. The methods not only 
rely on established convergence results, but are also simpler than variational 
approaches. By focusing on small structures (edges and triads) and treating 
these as if they were (but never assuming they are) independent, we prove that 
we may recover the main features of an affiliation model. We adopt strategies 
based on either solving moment equations or maximizing a composite marginal 
likelihood. A composite marginal likelihood consists in the product of marginal 
distributions and may replace the likelihood in models with some dependency 
structure [see for instance Cox and Reid, 2004, Variii, 2008]. In the weighted 
random graphs case, our result shows that parameters may be estimated rely- 
ing on a composite likelihood of univariate marginals. This is not the case for 
binary random graphs, because parameters of mixtures of univariate Bernoulli 
distributions are not identifiable. However, parameters of mixtures of 3-variate 
Bernoulli are identifiable [see AUman et al., 2009, Corollary 5]. Thus, in the bi- 
nary random graph case, we develop moment or composite likelihood methods 
based on the marginals of triads, namely the 3 random variables (AT^, Xik, Xjk) 
induced by a set of 3 nodes (i, j, k). 

Once the convergence of our estimators, let us say 6'„ to 6, has been estab- 
lished, the next question of interest concerns the order at which the discrepancy 
9,1 — 9 converges to zero. We establish asymptotic normality results, thus ob- 
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taining rates of convergence of our procedures. This is in sharp contrast with 
existing methods and the first insight on the difficuh issue of exhibiting (op- 
timal) rates of convergence for estimation procedures in these random graphs 
models. Indeed, a still open problem may be stated as follows: what is the para- 
metric rate of convergence when observing n(n— 1)/2 (non independent) random 
variables over a set of n nodes, distributed according to a random graph model? 
Is it ^1 \pn or 1/n? In other words, the issue is whether the observation of these 
potentially n(n — l)/2 dependent edges variables over a set of n nodes enables 
existence of estimation procedures with rates of convergence of the order 1 jn or 
rather 1/^/n. We obtain here theoretical results with rates of convergence of the 
order at least l/\Ai (which might not be optimal). Moreover, in the degenerate 
case where the group proportions are equal, the rates of convergence increase 
to 1/n. Our simulations seem also to indicate rates of convergence faster than 
1 /y/n, that might be due to degeneracies in the limiting variances of our central 
limit results, i.e. the fact that these variances might be zero. 

The paper is organized as follows. In Section 2, we state the different nota- 
tions, present the general assumptions of our model as well as the main result: 
a law of large numbers and a central limit theorem for normalized sums of func- 
tions of variables over a fc-tuple of nodes. Section 3 focuses on binary random 
graphs: after introducing the specific model for binary variables, we present 
two different estimation procedures. The first one (Section 3.1) relies on mo- 
ment equations and assumes that the group proportions tt are known, while 
the second one (Section 3.2) is more general and relies on composite likelihood. 
Section 4 presents the weighted random graph model as well as the parameter 
estimation procedure, relying also on a composite likelihood approach. While a 
first part of our work focuses on theoretical results about consistency of the pro- 
cedures, a second part is dedicated to algorithmic issues as well as experiments. 
In Section 5, we present the implementation of the estimation procedures. A 
particular attention is paid to the problem of unraveling the latent structure 
of the model (Section 5.2). In Section 6, the performances of our procedures 
are illustrated on synthetic data and we also provide the analysis of a real data 
example. Finally, all the proofs are postponed to Section 7. 

2. Model and main result 

Let us first give some notations that will be useful throughout this article. For 
any Q > 1, let Sq denote the simplex {(tti, . . . , ttq); tt^ > 0; Yl'i=i ^ 1} and 
Vq — {{vi, . . . , vq), Vi G {0, 1}, Y^f^i Vi — 1}. For the sake of simplicity, we only 
consider in the following undirected graphs with no self-loops. Easy generaliza- 
tions may be done to handle directed graphs, with or without self-loops. 

In this section, we define a general mixture model of random graphs in the 
following way. First, let {2i}i<i<n be i.i.d. vectors Zi = {Zn, . . . , Ziq) € Vq, 
following a multinomial distribution A^(l,7r), where tt = (tti, . . . , ttq) e Sq. 
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Random variable indicates to which group (among Q possibihties) node i 
belongs. These random variables are used to introduce heterogeneity in the 
random graph model. 

Next, the observations {Xij}i<i<j<n are indexed by the node pairs and 
take values in a general normed vector space X (in the next sections, X = {0, 1} 
or N or K'). We then assume that conditional on the latent classes {.^i}i<i<n, 
the random variables {Xij}i<ci<j<n are independent. Moreover, the conditional 
distribution of Xij depends only on Zi, Zj and has finite variance. The model 
may thus be summarized in the following way 



General model ■ 



{.Zi}i<j<„ i.i.d. vectors in Vg, with distribution A^(l,7r) 
{Xjj}i<i<j<„ observations in X, 

l<i<.7i) — ^l<z<j<n 

¥{X,,\Z,,Z,), 

E(||X,j||2|Z„Z,) < 

(1) 

It may be worth noting that the variables {^ij}i<i<j<ri are not independent 
in general, but we often make use of the fact that sets of non adjacent edges 
induce independent random variables. More precisely, if /, J C {1, . . . , n} with 
/n J = 0, then }(,jj)g/2 and {Xij}^ij-j^j2 are independent. 

In the next sections, we will focus on the particular affiliation mixture model, 
where the conditional distribution of an edge variable Xij only depends on 
whether the endpoints i,j belong to the same group (i.e. Z,; = Zj). We shall 
thus refer to the assumption 

Affiliation structure : P{Xij\Zi, Zj) = f{X,j\lz,=z,), (2) 

where 1a is the indicator function of the set A. 

Moreover, in the particular case of equal group proportions and affiliation 
structure, we shall observe some degeneracy phenomenas. These are due to the 
fact that the distribution becomes invariant under permutation of the specific 
values of the node groups (see Lemma 1 in Section 7 for more details) . For later 
use, we thus also introduce the equal group proportions setting 

Equal group proportions case : -Kq ~ 1/ Q for any q e {1, . . . , Q}. (3) 

Let us now motivate the following developments. Under the affiliation struc- 
ture assumption, the distribution of a single edge follows a two-components 
mixture of the form 

X,j ^ -1nX^J\Z, = Z,) + (1 - 7)P(Xy|Z, ^ Z,). 

For weighted random graphs, we shall assume a parametric form for this abso- 
lutely continuous conditional distribution, namely ¥{Xij\Zi — Zj) = ¥e^^{Xij) 
and ¥{Xij\Zi ^ Zj) = ¥g^^^{Xij). The vast majority of families of parametric 
absolutely continuous distributions give finite mixtures whose parameters are 
identifiable. This is equivalent to saying that E[log(7Peij^(Xi2)-l-(l— 7)Pe„„t (-'^'12))] 
has a unique maximum at the true parameter value (0in,^out)- This reasoning 
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is at the core of maximum hkeUhood estimation and motivates the introduction 
of a composite log-Hkehhood 

CT^^iO)- E log(7P..„(^.,) + (l-7)P.o„.(^.,)), 

l<i<j<n 

which is not the model likelihood as the random variables Xij are not indepen- 
dent. Its usefulness to estimate the parameters relies on whether the renormal- 
ized criterion £^™''°(0) / (n(n— 1)) converges to the expectation E[log(7Pej^ (-''^12)+ 
(1 — 7)Pe^^j (X12))]. We shall prove below that the answer is yes and thus, max- 
imizing with respect to 6* is a good strategy. 

In the binary random graph case however, the strategy has to be modified 
because each random variable Xij follows a mixture of univariate Bernoulli 
distributions whose parameters are not identifiable. We thus rather consider 
mixtures of 3-variate Bernoulli distributions which appear to be sufficient to 
consistently estimate the parameters. 

Thus, we are now interested more generally in the behavior of empirical 
sums of functions of the random variables induced by a fc-tuple of nodes. These 
empirical estimators are at the core of the estimation procedures that we shall 
later consider. To this aim, let us introduce some more notations. 

Define the set of nodes I = {1, . . . ,n} and the set of k distinct nodes Ik = 
{{ii, . . . ,ik) & ^ ii for any j ^ I}. (X^ is also the set of injective maps 

from {1, . . . , A;} to I = {1, . . . , n}). For any fixed integer fc > 1, and any fc-tuple 
of nodes i = (ii, . . . ,ife) € Ik, we let Xi = {Xi^i^, . . . , Xi^i^, Xi^,.^, . . .,Xi^_^iJ 
be the vector of p = (2) random variables induced by the fc-tuple of nodes i. 
Moreover, for any s > 1 and any measurable function g : — K*, we let 

= ^^^^ E and = E(5(X(i-^))). 

The next theorem establishes a strong law of large numbers as well as asymp- 
totic normality of the estimator rhg. As the random variables {Xij} are not 
independent, consistency (as well as asymptotic normality) of this empirical 
estimator is not trivial and has to be established carefully. 

Theorem 1. Under the assumptions of model (1) , for any fc, s > 1 and p = (2) 
and any measurable function g : — > such that E{\\g(K^^'''''''^)\\^) is finite, 
the estimator rhg is consistent 

rhg — rn„ almost surely, 

n— >oo 

as well as asymptotically normal ^^/n{rhg — mg) '^n^co A/'(0, Sg). If we moreover 
assume an affiliation structure (2) with equal group proportions (3) , then Y.g — 
and n{rhg — nrig) converges in distribution as n tends to infinity. 

Let us now give some comments about the previous result. First, an expres- 
sion for the limiting distribution Eg is given in the proof of the theorem. Such 
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an expression is useful for instance in the construction of confidence intervals. 
However, although our estimators of the model parameters are derived from 
estimators of the form rhg , we did not obtain here simple expressions for their 
limiting variance from an expression of T,g. Thus, rather than the exact form of 
the limiting distribution, we are more interested here in rates of convergence. 

The theorem states that the convergence of rhg to rUg happens with a rate 
at least 1/y/n. In the case where we consider an affiliation structure with equal 
group proportions, we prove that the limiting variance is null {i.e. T,g = 0), 
meaning that ^Jn{rhg — nig) converges in probability to zero. We then further 
prove that the sequence n(jhg — nig) converges in distribution (to some non 
Gaussian limit). Thus in this degenerate case, the convergence of rhg happens 
at the faster rate 1/n. 

We shall see that consistency as well as rates of convergence are preserved in 
the estimation procedures that we deduce from moment estimators of the form 
rhg. To our knowledge, this work is the first one giving some insights about con- 
sistency and rates of convergence of estimation procedures in random graphs 
mixtures models. 

In the next sections, we consider two particular instances of the mixture 
model defined in (1): the binary affiliation model (Section 3) and the weighted 
affiliation model (Section 4). 

3. Binary affiliation model 

In the case of binary random graphs, we observe binary random variables 
{Xij}i<i<j<„ indicating presence (1) or absence (0) of an edge between nodes 
i and j. The latent classes {.^i}i<i<n are still distributed as i.i.d. multinomial 
vectors on Vg. Conditional on these latent classes {^i}i<i<n, we assume that 
{Xij}i<i<j<„ are independent Bernoulli B{-) random variables, with parame- 
ters depending on the node groups. More precisely, we restrict our attention to 
the affiliation structure model (2), where nodes connect differently whether they 
belong to the same group or not. We let 

Vg,£e {!,..., Q}, = ^[j^j i[^ = ^; (4) 

Here, a and /? respectively are the intra-group and the inter-group connectivities 
and we let pqi = alq^e + P'^q^t, for any 1 < (7, £ < Q. In the following, we always 
assume a ^ 13. 

The whole parameter space is given by 

H = {(tt, a, TT e 5q n (0, a e (0, 1), /3 e (0, 1), a ^ /?}. 

We will use the notation b{x,p) = p^{l — (where x € {0, 1} and p G [0, 1]) 
for Bernoulli density with respect to counting measure. Note that in this setup. 
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the complete data log- likelihood simply writes 

n Q 
i=l q=l 

Q 

+ E E^'9^^9{^^jloga + (l-X,,)log(l-a)} 

l<2<j<n q—1 

+ E E Z,gZ,,{X,, log 13 + {l-X,,)log{l- (3)}. (5) 

l<i<j<n \<qi^l<Q 

Figure 1 (left part) displays an example of a binary random graph distributed 
according to this affiliation model. 

3.1. Moment estimators in the binary affiliation model with known 
group proportions 

The following approach based on moment equations was initially proposed by 
Frank and Harary [1982] to estimate the connectivity parameters a and /3 (as 
well as, in some cases, the number of groups Q). The core idea is simple: the 
moment equations corresponding to the distribution of a triplet {Xij,Xik,Xjk) 
give three equations which can be used to estimate the two parameters a and 
/3, as soon as the group proportions (also appearing in these equations) are 
known. However, this method has not been thoroughly checked by Frank and 
Harary and may give rise to multiple solutions. Indeed, these authors never 
discuss uniqueness of the solutions to the system of (non linear) equations they 
consider. This point has been partly discussed in Alhnan ct al. [2011] and the 
estimation procedures proposed here are an echo to the identifiability results 
obtained there. 

The following method applies only when the mixture proportions tt are 
known. We develop in Section 5 an algorithmic procedure that iteratively es- 
timates the group proportions in a first step, and the connectivity parameters 
{a, /3) in a second step. This second step uses the method we shall now describe. 

First, we let S2 — J2q '^g ^-nd S3 = J2q ^q- Then, one easily gets the formulas 

nil '■= E{Xij) = S2a + {1 — S2)(3, 

m2 := E(X,,X,fc) = 53^2 + 2(52 -S3)a/3 + (l~ 2s2 + S3)/?2, (6) 
m3 := E(JC„-X,fcX,fc) = S3a3 + 3(s2-S3)a/32 + (l-3s2 + 2s3)/3^. 

Since any triplet (Xij , Xik, Xjk) takes finitely many states, its distribution is 
completely characterized by a finite number of its moments. In the binary af- 
filiation mixture model context, there are in fact only three different moments 
induced by a triplet distribution. Thus, the previous three moment equations 
completely characterize the distribution of any triplet {Xij , Xn., Xjk). Note that 
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looking at higher order motifs, namely at the distribution of a set of p — (2) 
random variables over a set of k nodes for fc > 4 would provide more equations 
but would also lead to more intricate methods [see for instance Allman ct al., 
2011]. 

In the article by Allman et al. [2011], the possible solutions (with respect to a 
and /?) of this set of moment equations are examined. Their result distinguishes 
the equal group proportions case (tt, = 1/(5,V1 < q < Q) where a degeneracy 
phenomenon takes place. 

Theorem 2. [Allman et al., 2011]. If m2 7^ ml, then the -Kq 's are unequal and 
we can recover the parameters (3 and a via the rational formulas 

r, (s3 - S2S'i)m\ + {si - Ss)m2mi + (S3S2 - sD^s , "^i + (^2 - 1)^ 

P — ~i — 9 \/r^ 3 — N ^ — ■ 

(m^ - m2)(2s^ - 3S3S2 + S3) S2 

(7) 

If = m\ , then the tt^ 's are equal and we have 

, 1/3 

m3 ^ 



/? = TOi + Q-1 j ''"'^ " " ^""1 + " 

As soon as S2 and S3 are known, by plugging estimators of the moments rui 
into these equations, we obtain simple estimates for parameters a and /3. We 
thus first introduce empirical moment estimators w^, defined by 

'^1 = ~r~~~rT ^iJ' rh2 = —, ^Y? ^ X^jXik, 

n(n—l) ^ nin - 1 n - 2) ^ 



1 

n - 2') 

(i.'j,k)ei3 



rh-s — —. -^7 — y XijXikXjk. (9) 



Note that those estimators are all of the form rhg for some specific function 
g. Thus, their consistency is a consequence of Theorem 1. We are then able to 
prove the following result. 

Theorem 3. In the binary affiliation model .specified by (1) and (4), when the 
group proportions tt are supposed to be known, we have the following results. 

i) When the iTq 's are unequal, the estimators (a,/3) defined through (7) where 
the VTLi 's are replaced by the rhi 's, converge almost surely to (a, /3). More- 
over, the rate of this convergence is at least 1 / ^Jn. 

ii) When the tt^'s are equal, the estimators (d,/3) defined through (8) where 
the rUi 's are replaced by the rhi 's, converge almost surely to (a,/3). More- 
over, the rate of this convergence is at least 1/n. 

The performances of this method, combined with an iterative procedure to 
uncover the latent structure and estimate the group proportions are illustrated 
in Section 6. 
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3.2. M -estimators in the binary affiliation model 

We shall now describe another parameter estimation procedure based on M- 
estimators [see for instance van der Vaart, 1998, Chapter 5], i.e. estimators 
maximizing some criterion (here, a composite likelihood). This procedure is 
more direct than the previous moments method developed in Section 3.1, as it 
does not assume a preliminary knowledge of the group proportions tt. 

Let us recall that X'*'^-'^) = {Xij,Xik,Xjk). The random vectors X^'"-'''^) form 
a set of non independent, but identically distributed vectors, with distribution 
of each X*^*'-'''') given by the mixture 

lP7r,Q,/3(X*^'^'^^) = ^ TTqTriTTmb {Xi2, Pqe) b {Xi3, Pqrn) b {X23, Plm) , 
l<q,l,in<Q 

where we recall that pqg — alq^f + filq^e- In this mixture, many components 
are in fact equal. Indeed, the components reduce to only four (when Q = 2) or 
five (when Q > 3) different distributions. More precisely, we may write 

P^,a,0(X^'''''^) = lib{X,2,a)b{X,s,a)b{X23, a) 

+ j2b{Xi2,mXi3,mX23,a) + 73^(^12, /3)6(Xi3, a)b(X23, /3) 

+ jib{Xi2,a)b(Xi3, /3)b{X23,l3) + 75^(^12, mXi3, mX23, (10) 

where the five proportions 7 = (71, . . . , 75) € appearing in this mixture are 
related to the original proportions tt, by the following relations 

'71 = E^^l""? = S3, 

< 7j = J2i<q^i<Q T^l-^i ^ S2 - S3, for j e {2, 3, 4}, (n) 
75 = J2i<q.e,m<Q nq-KeTTm = 1 - 3S2 + 2S3. 

I \{q,e,m}\=3 

Note that when (5 = 2, the fifth proportion 75 is automatically equal to zero. 
Moreover, as soon as Q < 3, the set of equations (11) defines a one-to-one 
relation between tt and 7. However, when Q > 3, the parameter tt is not 
uniquely defined from 7 and is not identifiable from the mixture distribution 
(10). 

We emphasize that the distribution (10) is a constrained 3-variate Bernoulli 
mixture. Parameters identifiability of such a distribution is further discussed be- 
low. However we shall already remark that while parameters of mixture models 
may in general be identified only up to a permutation on the node labels, the 
constrained form of the mixture (10) has the following consequence: the param- 
eters a and /3 will be exactly recovered as soon as the mixture components are 
identified from (10) and whatever the labelling of these mixture components. 
Indeed, among the five unordered components of the mixture, only three of 
them will be the product of two identical one-dimensional distributions, times a 
different one. The parameter /3 is then the parameter appearing in exactly two 
marginals in any of those three components. 
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Let us consider as our criterion a composite marginal log-likelihood of the 
observations 

/:™(7r,a,/3)= logP^,„,^(X(^^^-^-)). (12) 

(ij,fc)ei3 

We stress that this quantity is not derived from the marginal of the model 
complete data likelihood (expressed in (5)) and is simpler. It would be the log- 
likelihood of the observations if the triplets {^^^'■''''^}{ij^k)(£i3 were independent, 
which is obviously not the case. We now define our estimators as 

(7r„,d„,/3„) = argmax£5^"'P°(7r,Q;,^). (13) 

Note that according to the non uniqueness of group proportions tt corresponding 
to mixture proportions 7, the maximum with respect to tt in the above equation 
may not be unique. We also let 7„ be defined from 7r„ through (11). 

Using Theorem 1, the renormalized criterion (12) converges to a limit. The 
key point here is that under an identifiability assumption on the model pa- 
rameters, this limit is a function whose maximum is attained only at the true 
parameter value {■j,a,(3). Using classical results from M-estimators [van dcr 
Vaart, ] 9fl8, Wakl, 1949], we can then obtain consistency and asymptotic nor- 
mality of the estimators defined through (13). We thus need here to assume the 
identifiability of the model parameters. 

Assumption 1. The parameters 7, a,/3 of the model defined by (10) are iden- 
tifiable. In other words, if there exist (7r,a,/3) and (7r',a',/3') such that for any 
{x, y, z) £ {0, 1}"^ we have 

^■^,a,p{Xi2 X.Xii = y,X23 ^ z) ^ V-K'M', 13' {X12 = 2^, ^13 = y,X23 = 2), 

then (7, a,/3) — (7', a', /?'), where 7,7' are defined through (11) as functions of 
tTjTt', respectively. 

Let us now give some comments on this assumption. We first mention that 
identifiability of all the parameters (tt, a, (3) in the model defined by (1) and (4), 
i.e. relying on the full distribution over U„>i{0, l}^^) (comprising the marginal 
distributions of the random graphs over a set of n nodes, for any value of n), 
is a difficult issue, for which only partial results have been obtained in AUman 
ct al. [2011]. Surprisingly, the results under the affiliation assumption are more 
difficult to obtain than in the non affiliation case. The question here is slightly 
different and we ask whether a triplet distribution (10) is sufficient to identify 
only a and /3 (as well as the corresponding proportions 7). As already pointed 
out, the distribution (10) is a constrained distribution from the larger class of 
3-variate Bernoulli mixtures. In the case of (unconstrained) finite mixtures of 
multivariate (or 3-variate) Bernoulli distributions, while the models have been 
used for decades and were strongly believed to be identifiable [C'arreira-Pcrpihan 
and Rcnals, 2000], the rigorous corresponding result has been established only 
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very recently and using rather elaborate techniques [see Allman ct al., 2009, 
Corollary 5]. Unfortunately, this latter result does not apply directly here. While 
this might be hard to establish, we strongly believe that 7, a, /3 are identifiable 
from the distribution (10) and we advocate that from the simulations we per- 
formed, it seems a reasonable assumption to make. 

In the following, we also restrict our attention to compact parameter spaces, 
as this greatly simplifies the proofs and is not much restrictive. Generalizations 
could be done at the cost of technicalities [see for instance van dcr Vaart, 1998, 
Chapter 5]. 

Assumption 2. Assume that there exists some S > such that the parameter 
space is restricted to Ilg — {(tt, G 11; VI < 9 < Q- > 5, a G [S,l — S]. (3 £ 
[5A-6]}. 

We are then able to prove the following result. 

Theorem 4. In the model defined by (1) and (4), under Assumptions 1 and 
2, the estimators (7„,an,/3n) defined by (13) are consistent, as the sample size 
n grows to infinity. Moreover, the rate of this convergence is at least l/\/n and 
increases to 1/n in the particular case of equal group proportions (3). 

Let us now comment this result. We prove that the rate of convergence of 
our estimators is at least l/^/n. However, our simulations (see Section 6) seem 
to exhibit a faster rate, indicating that the limiting covariance matrix of the 
discrepancy \fn{^^ — 7, (3;„ — a, /3„ — /3) might be zero, even beyond the case 
of equal group proportions. Note also that when Q ^ 3, a consequence of the 
above result is that the estimator of the group proportions defined through 
7„ as the unique solution to the system of equations (11), is also consistent and 
converges with a rate at least 1/y/n. 

As it is always the case for mixture models, the (composite) log-likelihood 
(12) cannot be computed exactly (except for very small sample sizes). Approxi- 
mate computation of the estimators in (13) can be done using an em algorithm 
[Dempster ct al., 1977]. This procedure is presented in Section 5.1. It is known 
[Wu, 1983] that, under reasonable assumptions, the EM algorithm will give a 
solution converging to the estimators defined by (13), as the number of iterates 
grows to infinity. 

4. Weighted random graphs 

In this section, we focus on a particular instance of model (1) for weighted 
random graphs. The observations are random variables {AT^ }i<,j<j<„ that are 
either equal to 0, indicating the absence of an edge between nodes i and j, or a 
non-null real number, indicating the weight of the corresponding edge. We still 
assume that conditional on the latent structure {Zi}i<ci<n, the random variables 
{Xij}i<i<j<„ are independent, and the distribution of each Xij only depends 
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on Zi and Zj . We now further specify the model by assuming the following form 
for this distribution 

yqje {!,..., Q}, X,^\Z,gZj,^l^pgJ{;0ge) + {l-p,M-), (14) 

where {f{-,9),d e 8} is a parametric family of distributions, Sq is the Dirac 
measure at and Pqe G (0, 1] are sparsity parameters. We let p — {pqi} and 
6 — {Oqi}. The conditional distribution of Xij is thus a mixture of a Dirac 
distribution at zero accounting for non present edges, with proportion given by 
the sparsity parameter p (which can be 1 in case of a complete weighted graph) 
and a parametric distribution with density / that gives the weight of present 
edges. We focus on two different sparsity structures 

• either the sparsity is constant across the graph: pq^ P, VI < 9, ^ < Q; 

• or the sparsity parameters model an affiliation structure Pqg = alq=i + 
Plq^i and we assume a ^ /3. 

We moreover assume that we know the sparsity structure type. In any case, the 
connectivity parameter is assumed to take exactly two different values 



Vq,£e{l,...,Q},V = 




with 0i„ Bout- For identifiability reasons, we also constrain the parametric 
family {/(•, 9), 9 E Q} such that any distribution in this set admits a continuous 
cumulative distribution function (c.d.f.) at zero. Indeed, if this were not the case, 
it would not be possible to distinguish between a zero weight and an absent edge. 
Note that this model satisfies the affiliation assumption given by (2). Here, the 
complete data log-likelihood simply writes 

n Q 

Cx.z{'^,P,d) = l0gP7r,p,e({-'^ij }l<i<i<n,{-Z'i}l<i<„) = ^iq log TTq 

i=l q=l 

+ XI XI Z,qZjt\lx,^M'^0gf{Xij,9ql)+\0gPql) + lx,^=Q\0g{l-Pql)y 

(15) 

We now give some examples of parametric families {f{-,9),9 €E Q} that could 
be used as weights (or values) on the edges. 

Example 1. Let 9 — {fj,,a^) e M x (0, +oo) and consider f{-,9) the density of 
the Gaussian distribution with mean fi and variance . 

Example 2. Let 9 G (0, +oo) and consider f{-,9) the density (with respect to 
the counting measure) of the Poisson distribution, with parameter 9, truncated 
at zero. Namely, 

Vfc>l, /(A:,^^) = ^(e''-l)-^ 
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Note that in the above example, the Poisson distribution is truncated at zero 
because, as previously mentioned, it would not be possible to distinguish a zero- 
valued weight from an absent edge. Sparsity of the graph is modeled through 
the parameter p only and the density /(•, 9) concerns weights on present edges. 
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Binary affiliation model 



Weighted affiliation model 



Fig 1. Simulation of binary (left) and Gaussian weighted (right) graphs with two classes and 
20 nodes. In each case, the picture displays the graph representations with vertices colored 
according to classes, as well as the adjacency matrices where each entry Xij of the matrix is 
the binary /weight value of the edge between vertex i and vertex j. The rows and columns of 
these matrices are organized according to the classes. 



Figure 1 illustrates the difference between binary and weighted random graph 
affiliation models. For example the weighted graph of Figure f displays no binary 
affiliation structure: if the weights where truncated using the function x — t- Ix^o, 
we would not obtain that the two groups have different intra-group and inter- 
group connectivities. This means that classical community clustering algorithms 
would fail to find any meaningful structure on this type of graph. 

To our knowledge, this model has never been proposed in this form in the 
literature. In particular, the closest form is given in Mariadassou ct al. [2010] 
who do not introduce a possible Dirac mass at zero to enable sparsity of the 
graph. 

Let us now describe our estimation procedure based on M-estimators and a 
composite likelihood criterion. We proceed in two steps and first estimate the 
sparsity parameter, relying on an induced binary random graph. In a second 
step, we plug-in this estimator and focus on the connectivity parameters 9 by 
relying only on the present edges. 

Estimating the sparsity parameter. Let us first consider the case where 
Pqi — P- Then, we naturally estimate the sparsity parameter p by 

Pn = —, TV lx,j=o- 

n(n — 1 ^-^ ^ 

^ ' l<i<j<n 



The consistency, as well as well as the rate of convergence of this estimator 
follows from Theorem 1. 
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In the case where the sparsity parameter rather satisfies Pqi — alq=i + /31g^i, 
with a ^ /3, we rely on the underlying binary random graph (obtained by setting 
Yij = Ixij^o) and apply the results of Sections 3.1 or 3.2 to consistently estimate 
a and (3. 

Estimating the connectivity parameter 6. The present edges Xij (where 
i,j are such that Xij ^ 0) are non independent random variables, distributed 
according to a simple univariate mixture model X]<j£ ^<;^^P9^/('i ^g^)- clas- 
sical distributions f{-,0), it is possible to estimate the connectivity parameters 
0qi of this univariate mixture directly. In fact, we prove that as soon as the 
parameters {Oqi} are uniquely identified from the mixture '^qgT^qT^iPqef{',(^qe) 
and for regular parametric families {f{-,6), 9 € 9}, a consequence of Theorem 1 
is that maximizing a composite likelihood of the set of present edge variables 
provides a consistent estimator of the parameters. Let us introduce the needed 
assumptions. 

Assumption 3. The parameters of finite mixtures of the family of measures 
T = {/(s^*); Q G 6} are identifiable (up to label swapping). In other words, for 
any integer m> 1, 

m m m m 

i—l i—1 i—1 i—1 

Example 1, 2 (continued). Note that both the families of Gaussian and 
truncated Poisson densities satisfy Assumption 3. More generally, a wide range 
of parametric families of densities on M satisfy Assumption 3 [see Section 3.1 
in Titterington et ai, 1985, for more details]. 

The next assumption deals with regularity conditions on the model. Note 
that this assumption could be weakened using the concept of differentiability in 
quadratic mean [see for instance van dcr Vaart, 1998]. 

Assumption 4. The functions 6 i— > f{-,0) are twice continuously differentiable 
on Q. 

The last assumption is only technical and not very restrictive. It requires 
the parameter set to be compact and could be weakened at the cost of some 
technicalities. 

Assumption 5. Assume that there exists some S > and some compact subset 
Oc C O such that the parameter space is restricted to the set {(7r,p, 0);V1 < 
q<Q,Trq>S,p& [5,1-J],6» e ej. 

Now, each present edge variable Xij such that Xij is distributed accord- 
ing to the mixture '^i<:q £<:Q'^q'^ePq£f{'',dqe)- As there are only two different 
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components in this mixture, we express it in the more convenient form 
Q 

TTgTTfPgf |/(Xy ; 6'out) 

9=1 l<q^e<Q 

:= linfiX,f,Oin) + 7out/(^»j; ^out). (16) 

We consider a composite log-Ukelihood of present edges defined by 

/:™(7r,p,0)= Y log(7in/(X,,;0in)+7out/(X„;0out)). (17) 

l<i<j<n 

We stress that this quantity is not derived from the marginal of the model 
complete data likelihood (expressed in (15)) and is simpler. We now define 
estimators as 

dn = {din, dout} = argmax/:5^'°P°(7r, p„, 9), (18) 
e 

where p„ is a preliminary step estimator of p. Note that due to the label swap- 
ping issue on the hidden states, we estimate the set of values {9in, ^out} and can 
not distinguish 9in from 6'out- Section 5.2 further deals with this issue. 
We are now able to prove the following theorem. 

Theorem 5. In the model defined by (1) and (14), under Assumptions 3, 4 
and 5, the set of unordered M-estimators On = {^m, 9out\ defined by (18) is 
consistent, as the sample size n grows to infinity. Moreover, the rate of this 
convergence is at least l/\/n and increases to 1/n in the particular case of equal 
group proportions (3). 

The proof mainly relies on the consistency of the normalized criterion (17). 
This point is a direct consequence of Theorem 1. Then, from the criterion con- 
sistency, the identiflability and the regularity assumptions, one can derive the 
consistency of the corresponding M-estimator from classical theory [van der 
Vaart, 1998, Wald, 1949]. 

As already noted in the case of Theorem 4, our result establishes a rate 
of convergence equal at least to 1/y/n. The simulations (Section 6) seem to 
indicate that the rate may in fact be faster, a phenomenon that may be due to 
the degeneracy of the limiting variance of y/n{0n — &)■ 

As for M-estimators in the binary case, we shall approximate this maximum 
(composite) likelihood estimator using an em procedure [Dempster et al., 1977] 
whose convergence properties are well-established [Wu, 1983]. Contrarily the 
procedure presented in Section 3.2 where we need to adapt the EM framework 
to our specific model, we rely here on the classical em algorithm and thus do 
not recall it. 

5. Algorithms 

In this section, we provide tools to implement the procedures previously de- 
scribed, as well as a complement on the issue of recovering the latent structure 
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of a graph. 



5.1. EM algorithm with triplets 

Let us describe in this section the em algorithm developed to approximate the 
estimators defined by (13). 

In the following, each set of three nodes {i, j, k} corresponds to an index i 
ranging over the set {1, . . . , N}, where N — n{n — \){n — 2) is the total number 
of triplets. We let Xi = {X-'^,X-''^, X^-''^) be one of the observed triplets (namely 
each X-'^ for 1 < j < 3 corresponds to some former random variable X^t for 
some I < s,t < n) and Ui ^ A^(l,7) is the vector encoding the corresponding 
hidden state. Namely, JJ, G V5. We also denote by T\k the posterior probability 
of node triplet i being in state fc, conditional on the observation X-, namely 
Tifc = P(C/ifc — l|Xi), for 1 < A; < 5 and 1 < i < iV. Moreover, we encode the 
fact that, conditional on the 5 different hidden states of U , each coordinate of 
X is distributed according either to B{a) or yS(/3), using the following notation 

5jk = (<5]fe,'5|j,) = (lxi.3|(7ifc = l~B(a); lxi.J|;7ifc = l~B(;9))j 

for all 1 < j < 3, 1 < fc < 5 and any 1 < i < iV. Note that 5jk is deterministic 
and that 5jk G V2. With these notations, we are in the situation where we 
consider a composite Hkelihood (12) of random vectors {Xi}i<i<Ar from the 
mixture of 5 different 3-dimensional Bernoulli distributions, the latent classes 
being the random vectors {?/i}i<i<jv- 

The EM-algorithm is intended to iteratively compute and optimize, with re- 
spect to (7, a, (3) the function 

Q((7,a,/3);(7(^\a(^\/3(^))) =E^(.)^„(.,_^(,)[logP.,,„,^({[/Ji<,<jv,{Xi}i<,<Ar)|{^^^ 

using the current value of the parameter {■j^^\a^'^\ /3^'^^). If we let t^^^ = 
P-Y(=),a(=),0(=)(CAfe = we can write 



l<i<7V 



g((7, a, 13); {j(^\a^^\ (3^^^)) = ^ E l°g^^ + E E ^. 



N 5 N 5 

(^) 
ik 

i=l k=l i=l k=l 



3 



6],[X'-'^ \oga+{l-X'-^^)log{l^a)]+5%[x'-^ log I3+{1-X'-^) log(l-/?)}. 

(19) 

Starting from an initial value (7^^^ a*^^^ , /3*-^-'), the em algorithm proceeds in two 
iterative steps. At iteration s, the E-step computes the posterior distribution of 
Ui conditional on X-^-. Namely, 

(.)_p ... 7l-^n^^r^(^^'^^>(-)+^|,/3(-)) 

^ifc ^ (3) Q(a) fl(3) [Uik — i A-j — T-T — , 
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for every 1 < i < and every 1 < fc < 5. By using (19), we then get the value 
of Q((7, a, f3); (7*-''-', In the M-step, this quantity is maximized with 

respect to (7,a,/3) and the maximizer gives the next value of the parameter 
(^(s+i) ^(s+i)^^ This step relies on the following equations. 

^(.+1) = r^i\x'--' + + + r4^)xi.3 + ^ r^l^X'-^^) 

It should be noted that the sum over all the N possible triplets reduces in fact 
to a sum over 8 different possible patterns for the values of X-*-. Indeed, the 
posterior probabilities nk are constant across triplets with the same observed 
value. 



5.2. Unraveling the latent structure 

The general method we develop in this section aims at recovering the latent 
structure {Z^Y<i<n on the graph nodes. Indeed, the procedures developed in 
the previous sections only focus on estimating the parameters and do not directly 
provide an estimate for the node groups. 

We rely here on a simple method: we plug-in the estimators obtained from 
the previous sections in the complete data likelihood of the model (namely the 
likehhood of the observations {-'^ij}i<i<j<„ and the latent classes {^i}i<i<n)- 
As we do not have estimates of the mixture proportions tt, we simply remove 
this part from the expression of the complete data likelihood. Then, we simply 
maximize this criterion (which we call a classification likelihood) with respect to 
the latent structure {Zi}Ki<n- In a latter step, we then estimate the unknown 
proportions tt by the frequencies observed on the estimated groups Zi. 

Criterion in the binary case. In this setup, we introduce a criterion C, built 
on the complete data likelihood, where we plugged-in the estimators of a and 
P and removed the dependency on tt. This criterion simply writes 

Q 

C({ZJi<,<„) = J2 J2 Z,qZ,g{X,, loga + (1 - X,,) log(l - a)} 

l<i<j<n q—1 

+ E E ^.:,^,£{^ylog/3 + (l-X.,)log(l-/3)}. 

l<i<j<n l<q^£<Q 
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Criterion in the weighted case. Let us recall that the estimation procedure 
from Section 4 only recovers the set of unordered values {Oin, 6'out}- As we know 
these parameters up to permutation only, let {^i, 62} be any label choice for the 
corresponding estimators. We can consider two different criteria, denoted C^'^ 
and C^'^, as follows 

C"''"({Zi}i<i<„) — 

Z,gZji[lx,^^o(}og fiX.j-Ju) + ^ogpgi) + lx„=o log(l - Pqe)] 

l<i<j<n 
l<q^£<Q 

l<i<j<n 

i<q<Q 

where {u, v} = {1,2}. For each of these criteria, we can select the latent struc- 
ture Z"^" = {Zi, . . . , Z„)"^" maximizing it. Then, choosing the couple {u*, v*) 
maximizing the resulting quantity Q^^^ (^z^'^) seems to be an interesting strat- 
egy. We thus finally define our estimated latent structure {Zi, . . . , Z„) as . 

Iterative estimation of the latent structure. In any case (either binary or 
weighted), we propose to use an iterative procedure to compute the maximum 
Z of the criterion C{{Zi}i). Starting from an initial value Z*^^^ — {z[^\ . . . , Zn^) 
of the latent structure, we iterate the following steps. At step s, we (uniformly) 
choose a node io and select z|^^^^' as 

^zo^^'' = argmaxC({zf = q) 

l<q<Q 

(s-\-l) (s) 

while we let Zj — for j ^ io- At each time step, we increase the clas- 
sification likelihood C({Zj}i<j<„) and thus the procedure eventually converges 

to a (local) maxima. By using different initial values Z^^'' = {z[^\ . . . , Zn'^), we 
should finally find the global maxima. Once we estimated the latent groups Zi, 
we may obtain an estimate of the group proportions tt by simply taking the cor- 
responding frequencies. The procedure is summarized in Function latent . structure. 



Function latent . structure (^rap/i, parameters) 
input : observed graph, parameter values 
output: latent structure and group proportions 

Start from latent structure {Zi} 
while convergence is not attained do 
Choose node io 
_ Replace Zi„ with argmax^C({Zji^i„, Zi„ = q) 

Compute group proportions tt from {Zi} 
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5.3. Complete algorithm description 

Algorithm 2 describes the procedures for analyzing binary or weighted random 
graphs. We introduce a variable 'method' which can take three different values: 
'moments' or 'tripletEM' in the binary case and 'weighted' for weighted random 
graphs. In the weighted case, we moreover use a second variable called 'sparsity' 
to indicate whether we estimate a global sparsity parameter p ('sparsity=globar) 
or two parameters a,/3 from an affiliation structure ('sparsity=affiliation'). The 
performances of the procedures proposed in the current section are illustrated 
in the following one. 



Algorithm 2: Complete algorithm 

if method= 'moments ' then 

Compute rhi,i ~ 1,2,3 from (9) 
// INITIALIZATION 

Start from latent structure {Zi} with proportions tt and compute S2, S3 
while convergence is not attained do 

// UPDATE PARAMETERS 

if ahs{m2 — rh\) < e then 
[_ Compute a, /? through (8) 

else 

[_ Compute a,/3 through (7) 
// UPDATE LATENT STRUCTURE 
|_ Apply latent . structure to the current parameter values 

if method= 'tripletEM' then 

Estimate a, /3 from em algorithm with triplets (Section 5.1) 
Apply latent . structure to the parameter values 

if method— 'weighted' then 
// SPARSITY PARAMETERS 

Transform weights Xij into binary variables Yij — Ixi -^o 
if sparsity = 'global' then 

L Compute p = Y.r<j 

else 

|_ Estimate a,/? from EM algorithm with triplets (Section 5.1) 
// CONNECTIVITY PARAMETERS 

Estimate \d\a,BovLt \ from em algorithm with present edges (Section 4) 
// LATENT STRUCTURE 

Apply latent . structure to the parameter values 



C. Ambroise and C. Matias/Estimation in random graph mixtures 



21 



Table 1 

From left to right: low inter-group connectivity and strong intra-group connectivity (model 

1), strong inter-group connectivity and low intra-group connectivity (model 2), model 
without structure close to Erdds-Renyi random graph model (model 3). The picture displays 

an example with Q = 2 groups. 

Model 1 2 3 



a = 0.3 a = 0.03 a = 0.55 

/? = 0.03 13 = 0.3 13 = 0.45 




6. Numerical experiments 

We carried out a siniulation study to examine the bias and variance of the 
proposed estimators. In the binary afShation model, we also compared the per- 
formance of our proposal with the variational em (vem) strategy proposed by 
Daudiu ct al. [2008]. Note that Gibbs sampling has already been compared to 
VEM strategies in Zanghi et al. [2010] and give very similar results. Note also 
that the weighted affiliation model proposed here is original and we thus cannot 
compare our results in this case with any other existing implemented method. 

6.1. Binary affiliation model 

Simulations set-up. In these experiments, we assumed that edges are dis- 
tributed according to the binary affiliation model described in Section 3. The 
data were generated in different settings, with the number of groups Q £ {2, 5}, 
the number of vertices n e {20, 50, 100, 500, 1000}. For each of these cases, we 
created three settings corresponding to models with different ratios of intra and 
inter-group connectivity parameters (see Table 1). Moreover, we considered two 
different cases: equal or free group proportions tt. 

In each of these settings, we applied three different methods for estimating 
the model parameters: the moment method (corresponding to Section 3.1), the 
triplet EM method (corresponding to Section 3.2) and the variational em strategy 
(vem) proposed by Daudin et al. [2008], that we adapted to constrain it to an 
affiliation structure. The results for equal or free group proportions were similar 
and we thus present only the equal group proportions case. 

Figure 2 shows the estimated density (over 100 graphs simulations) of the 
estimators a and /3 for the three algorithms and the three models for graphs 
with 500 vertices. We see that for a given model the three methods produce 
estimators with similar densities. In particular, the estimators of a and /3 seem 
to have little or no bias and the variances are of the same order of magnitude for 
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the three estimation methods. As the behaviour of the estimators of a and /? are 
comparable over aU the simulations, we focus the discussion on the estimation 
of the parameter a. 





Figure 3 (top) displays the estimations of a averaged over 100 graph simu- 
lations as a function of the number of graph vertices in log-scale. For all three 
models, we see that all algorithms do produce unbiased estimation when the 
number of vertices is large enough. In addition to the asymptotically unbiased 
estimation, we observe an agreement in the sign of the bias among all algo- 
rithms, when the graphs are small. For example, when estimating a in model 1 
where (a — 0.3, /? — 0.03), all methods under-estimate a and over-estimate /3. 

In order to compare the dispersion of all estimators, we consider their em- 
pirical standard deviation computed over 100 simulations. Figure 3 (bottom) 
shows the evolution of the log of the empirical standard deviation of a when the 
size of the graphs grows from 20 vertices up to 1000 vertices. We see a linear 
dependence between the log of the graph size and the log standard deviation. 
The slope of the lines is about —1 which indicates that the standard deviation 
decreases with rate of the order 1 /n (where n is the number of vertices of the 
graph). The differences between the intercepts relate to constant factors driv- 
ing the relations between all rates of convergence. When Q = 2, we observe 
very similar intercepts for all methods, both for models 1 and 2. When Q = 5, 
VEM appears to converge faster but the order of the standard deviations remain 
comparable among all estimation methods. For model 3, the moment based 
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logio n 

Fig 3. a (top) and the log of its empirical standard deviation (bottom) averaged over 100 
graph simulations, as functions of the number of graph vertices in log-scale for equal group 
proportions. 



estimations have greater dispersion, but still decrease with the same rate. 
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Fig 4. Evaluation of Rand Index to compare true and estimated latent structure of binary 
affiliation graph. 

We use the adjusted Rand Index [Hubert and Arabic, 1985] to evaluate the 
agreement between the estimated and the true latent structure. The computa- 
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tion of the Rand Index is based on a ratio between the number of node pairs 
belonging to the same and to different classes when considering the actual la- 
tent structure versus the estimated one. This index lies between and 1, two 
identical latent structures having an adjusted Rand Index equal to 1. Figure 4 
displays the Rand Index for the three models and five different graph sizes. 
It appears that the three algorithms allow a reasonable recovery of the latent 
structure, for models 1 and 2, when the considered graphs have more than 100 
vertices. As expected, the larger the number of nodes, the better the recovery of 
the latent structure we observe. We also notice that our proposed strategy for 
recovering the latent structure performs as well or better than the variational 
approach in all cases. 

The previous experiments show that the two estimation procedures proposed 
in this work behave as well or better than the variational based algorithm, both 
for the parameter estimation and the recovery of the latent structure. Notice also 
that the moment based method does not depend on any sort of initialization, 
since it relics on the analytical resolution of a simple system based on triads 
(order 3 structures). 

6.2. Weighted affiliation model 

Simulations set-up. In the following experiments, we use a sparsity param- 
eter constant across the graph and non missing edges are distributed according 
to a Gaussian model as described in Section 4, with different means /lin and 
/Zout and equal variance a^. The intricacy of a model is inversely related to the 
'distance' between the parameters 0in and ^out- We use the Mahalanobis dis- 
tance A = K^in — Mout)/^!- Three models are considered with different levels of 
intricacy: we fix = 2 and /iout — Ij thus A = — /iout)/f I = 1/c which 
takes the values A = 10 (model A), A = 2 (model B) and A = 1 (model C). We 
fix the number of groups Q — 2, equal group proportions and consider different 
number of vertices n € {20, 100, 500, 1000}. 

We computed bias and empirical standard deviations over 100 simulations. As 
illustrated by Figure 5(a) in the case of /tin, the method recovers the parameters 
with no bias, except for model C where a small bias occurs due to the high level 
of intricacy of the model. Figure 5(b) displays the evolution of the log of the 
empirical standard deviation of /tin when the size of the graphs grows from 
20 vertices up to 1000 vertices. As for the binary affiliation model estimators, 
we observe a linear dependence between the log of the graph size and the log 
standard deviation, the slope of the hues lying in [—1/2,-1]. 

Figure 5(b) displays the Rand Index for the three different models (A,B,C) 
and four different graph sizes. When graphs have more than 100 nodes, recovery 
of the hidden structure is almost perfect in all situations as previously observed 
in the binary case. 

The previous experiments show that when dealing with weigthed affiliation 
graphs, the estimation of the parameters and of the graph latent structure can 
be efficiently achieved considering only edges (order 2 structures). 
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Fig 5. Evaluation of the estimation of the parameters of a weighted graph, (a) fii„ and (b) 
log of its empirical standard deviation, both as functions of the number of graph vertices, 
expressed in log-scale. Each estimation is averaged over 100 graph simulations, (c) Rand 
Index computation comparing the true latent structure to the estimated one for the three 
models (columns A, B, C) and four graph sizes (rows n = 20, 100, 500, lOOOj. 



6.3. Cross-citations of economics journals 

Let us illustrate the difference between weighted and binary models for graph 
clustering using a real data example. We consider cross- citations of 42 economics 
journals over the years 1995-1997 [Pictcrs and Baimigartncr, 2002]. The raw data 
corresponds to a weighted non symmetric graph where vertices are journals and 
directed edges the number of citations from one journal to another one. We first 
take the mean value of citations between each pair of journals (leading to a sym- 
metric adjacency matrix) and work with its normalized Laplacian. Figure 6 dis- 
plays the affiliation matrices structured according to a partition in four classes. 
Clustering based on the binary model and on the weighted model (respectively 
left and right sides of Figure 6) exhibit very different cluster structures. The 
binary model finds classes which tend to be homogeneous in terms of probabil- 
ity of intra-group and inter-group connections, while the weighted model finds 
classes which are homogeneous in terms of intra-group and inter-group connec- 
tion weights. This distinction results in completely different interpretations. 

The binary model finds two groups of nodes which are strongly connected 
within their groups but also with nodes from the other groups. It also exhibit two 
other smaller classes with low intra-group connectivity and nodes that preferen- 
tially link to the first class which plays the role of a reference class. Indeed the 
first class (top left) found by the binary model is composed by journals with high 
impact factors: American Economic Review (AER), Econometrica (E), Journal 
of Economic Literature (JEL), Journal of Economic Perspectives (JEP), Journal 
of Political Economy ( JPE) , Quarterly Journal of Economics (Q JE) , Review of 
Economic Studies (RES) and Review of Economics and Statistics (RES2). 

The result produced by the weighted model shows a main class of strongly 
interconnected journals and three smaller classes of journals, which weakly cross- 
cite each other: 
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Fig 6. Matrices of cross-citations between J^2 economics journals with rows and columns 
reorganized according to groups found by the binary random graph mixture model (a) compared 
to groups found with the weighted random graph mixture model (b). 



Class 1 (health) Health Economics (HE), Journal of Health Economics (JHE); 
Class 2 (natural resources) Journal of Agricultural Economics (AJAE), Land 

Economics (LAE), Journal of Environmental Economics and Management 

(JEEM); 

Class 3 (economic history) Exploration of Economic History (EEH), Journal 
of Economic History (JEH), Economic History Review (EHR). 

Each of these three classes is composed by journals dedicated to similar topics 
(respectively, health, natural resources and economic history). They preferen- 
tially cite journals from the first class which contains journals with less specific 
topics. 

7. Proofs 

Proof of Theorem 1. In order to facilitate the reading of the proof, we decom- 
pose it into several stages. 

Preliminaries. We fix fc, s > 1 and p — (2) . Let us recall that Vq is the set 
of Q-size vectors such that for any v = (ui, . . . , Vq) £ Vq, we have Vi G {0, 1} 
and Y^f^i Vi = 1. We also let Q = {1, . . . , Q}. We then consider the set 

Z . {z . V, = (,„...,,.) E Q^ ^^n,_ := ^ f[ z,, ^^^^ f[ 

ieifc 1=1 1=1 

Moreover, we let Nq — J^iex^ O/Li ^iiqr The strong law of large numbers gives 
the almost sure convergence, as n tends to infinity, of [{n~k)\/nl]Nq to J^fLj^ tt^j. 
This implies that P({2'„}„>i e Z) = 1. 
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Consistency of rhg. We first introduce tlie conditional mean of g{^-) given 
that tlie Iridden groups at position i are given by q 

k 

mg(g)=E(g(Xi)|n^.,,, =1 
1=1 

Using the equalities 

k k 

VieXfc, ^ =1 and ^ ([]^,,)TO<,(q), (20) 

qeQtl=l qeQk 1 = 1 

we may write the decomposition 

fc fe 



(n-ky. 

'^g~'^g = , 



qeQ''ieXki=i qeQ'' i=i 

qeQfc ■ iGlfc 1=1 ■ 1=1 



(21) 



In order to establish the consistency of rhg, we rely on a conditioning argu- 
ment. Let A be the event limsup„_>.o^ \mg — mg\ — 0. We then have 

P(A)=E[E(U|{Z„}„>i)]. (22) 

Now, conditional on {Zn}n>i = z, the random variables {X-*-; i e 1^, Jl/=i ^iiqi — 
1} form a n^-sample of independent and identically distributed random vari- 
ables. Letting B be the event 

^ E {9m~mg{q))\^Q, 

the strong law of large numbers yields that for any z e Z, 

E(ls|{^„}„>i = z) = 1. 

Conditional on {Z„}„>i = z G Z, we may thus re- write the decomposition 
(21) as 

-{n-k)\ 1 



limsup 

n—^oc -^^ q 



{n-ky. 



.(2)(^"i-n 



9! 

1=1 



which establishes that for any z G Z, we have E(lyi|{Z„}„>i = z) = 1. Coming 
back to (22), we thus obtain 



P( lim m„ = Too) = 1. 

n— voo 
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Asymptotic normality of rhg. Let us now prove a central limit result for 
y/n{rhg — nig). First, the central limit theorem applied to the Q-size vector 
X]r=i(-^« — -K^l^fn gives the following convergence 

1 " 

-= ^(^i - tt) --^ 7V(0, E), as n cx), (23) 
"^^^ 1=1 

where Sg, = 7rq(l — tt^) and S^f = —i^qKg when (.. 

Now, let us consider the second term appearing in the right hand side of (21). 
To establish a central limit theorem for Nq, we decompose the sum of products 



fe 

iGlk 1 = 1 ielfc 1=1 



into sums of products of centered terms {Zi^q^ — vTgJ. This leads to 

1=1 u=l L<l{l,...,k}-\L\=u l^L ieXi, /GL 

where \L\ denotes the cardinality of the set L and II denotes the set of injective 
maps from L to Z = {1, . . . , n}. In this expression, the leading term (obtained 
for singleton sets L, i.e. when u — V) gives the rate of convergence in the central 
limit theorem. In other words, 



\ n\ - 



/c \ fc 1 ^ 

+i: ^^^^(n'.)i:n(^"«-'«)- i^") 

«=2 Lc{l,...,fc};|-L|=« i^L iEIl leL 

The first term in the right hand side of (24) converges to a linear combination 
of the coordinates of a Af{0, S) vector, whereas the second term converges to 
zero. Indeed, for any value u > 2 and any set L of cardinality u, we may write 

I E IT(^*"?' ^ ^7 ? E T\(^nqi -^qi) 

which converges to zero. Thus we get. 



'n 



E -.(^)[^^.-ri-J - E ™.(^)[E%r'E(^..-. 

qeC" ■ fc=i 9eQ'= i=i ^ j=i 



/)+i?„. 



where Rn,q — op(l) are negligible terms converging in probability to zero, as n 
tends to infinity. According to (23), we obtain that 

fe fc 

In 
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where W = {Wi, Wq) ~ 7V(0, E). 

To obtain a central limit theorem for rhg it now suffices to prove that the 
first term in the right hand side of (21) is negligible, when scaled by the rate of 
convergence ^/n. Indeed, we may write this term as 

which satisfies, for any fc > 2, any e > and any z € Z, 

P(|i?n| > e|{^n}„>l - Z) ^„^^ 0. 

Using dominated convergence, we also have P(|^„| > e) — >„^oo 0, for any e > 0. 
Now, going back to (21), we finally obtain 

k 

°° qeQ'' 1=1 J^l 

Expression for the limiting variance Sg. The computation of the variance 
Eg could be done using the above expression, but this leads to tedious formulas. 
A simpler expression of the limiting variance is obtained in the following way. 
We prove that y^Un '■= y/n(rhg — rUg) has a bounded third order moment. This 
is sufficient to claim that Eg can be obtained as the limiting variance of \/nC/„. 

First, since non adjacent edges form independent variates, it is easy to see 
that we have 

mV^Uj')<( t^'l^lX E E(||g(Xi)-mg|l||.9(Xi)-mg|l|l5(Xi)-mg|l), 
^ '''' ij,{;inin{5^0 

where iPli stands for the intersection of i and i viewed as index sets (instead of k- 
tuples). The above sum contains at most fc^n[(n — 1) • • • (n— fc + 1)]'^ terms, which 
are bounded (there are finitely many of them). Thus this quantity converges to 
zero as n tends to infinity. Moreover, 

Var(ynC/„)^f-i|-i)' Cov(g(Xi), ^(Xi)). 



i,i;ini^0 



The above sum may be decomposed according to the cardinality of the set i fl i. 
It is then easy to see that the dominating term is obtained when |i n i| = 1, 
while the other terms converge to zero. Namely 

Var(V^L/„)= ( E Cov(.9(Xi),.g(Xi)) + o(l). 

To describe all the possible configurations where |i H i| — 1, we may fix the 
first index i to (1, . . . , /c) and let the second index \ describe the set of indexes 
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where some position s takes one of the values {I, . . . ,k} (corresponding to the 
intersection ini) and at any other position, there is some value in {fc + 1, . . . , n}. 
For any s,t e {!,..., fc}, we thus let e* £ Ik satisfying e' (s) = t and e* (j) G 
{A: + 1, . . . , n} for any j ^ s. With this notation, we obtain 

k k 

= lim Var(V^[/„) = V VCov(g(X(i'-''=)),5(X^*))- 

s=l t=l 

Note that in the case of an affiliation structure with equal group proportions, we 
could prove from this expression that Eg = (using for instance the results of 
Lemma 1 presented below). Anyway this will be a consequence of the following 
developments. 

The degenerate case. Let us now finish this proof by considering the specific 
case where we have an affiliation structure (2) and equal group proportions (3). 
Coming back to (21), we write rhg — nig = Ti + T2 where 

qeQ'' 1 = 1 

We first deal with the second term T^- According to (24), we have 

k 11^ 

qfzQk 1=1 ^ i=l 

+ E E ^QiUEn(^^^.-.)-^-+^- 

ii=2 Lc{i,...,fc};|L|=M ^ ieiL iei 

We now prove that the first term in the right hand side of this equality, namely 
T2_i is zero. This result relies on the following lemma, stating that the model is 
invariant under a permutation of the values of the node groups. 

Lemma 1. Under the assumptions and notations of Theorem 1, assuming more- 
over (2) and (3), for any a G Sq the set of permutations of Q, we have 

{{Zi\i<i<m {Xi.j}i<,i^j<n) =({o'(.^i)}i<i<n, }i<j<j<„) , 

where —'^ means equality in distribution. As a consequence, for any value q G Q'^, 
the conditional expectation mg{q) is constant along the orbit (induced by Sq) of 
the point q, i.e. the set of values {mg{a{q)); a € Sq} is a singleton for any fixed 
qeQ''. ' 
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Indeed, according to (1), (2) and (3), and using that any permutation ct is a 
one-to-one application, we have 

n 

l<^<n 7 l<i<ji'<n 

i— 1 ^^'i<j<n 

^ l<i<j<n 

As a consequence, for any a e Sq and any value q e Q'^, the conditional 
expectation mg{a{q)) satisfies 

m<,(a(g)) =E(.g(X(i--"))|(Zi,...,Zfe) = 

= E(g(X(i-^'=))|(a(Zi), . . . , a{Zk)) = = m,(g). 

Thus the set of values {mg{cr{q));<j E Sq} is reduced to a singleton. This fin- 
ishes the proof of the lemma. 

Now, going back to the term T'2,1, the set Q*^ may be partitioned into the 
disjoint union of the orbits induced by Sq, namely Q'^ = ^OorbitO, with q — >■ 
mg{q) being constant on each orbit O. We let mg^o denote the value of the 
function q — > mg{q) on the orbit O. Then we write 



^ k n 



Oorbit 1=1 i=l qeO 

For each orbit O and any position / e {1, . . . , k}, if we fix some q £ then we 
argue that O contains all the points of the form {qi, . . . , j, gj+i, . . . ,qk) for 
any 1 < j < Q. Indeed, all these points are images of q by the simple transposi- 
tions {qi j). Thus, the sum Y.qeo'^^iii ~'^qi) contains fhe sum Y^qieQ^^^m 
which is zero. This proves that T2^i = and thus 

nimg-rug) = niT^+T,,,) = (II^AtV^^ ^ (g(Xi)-m,(g)) 

ni = l Zi^qi 

+ CW^ 7;^^) {Zrq-^q){Zjt-T^l)+0{l), 

q,ieQ,q^e ^ ' l<i/j<n 

where, as in the non degenerate case, we argued that the terms in r2,2 involving 
sets L with cardinality u > 3 are negligible. We then obtain that for A: = 2, we 
have 

n{rng - rUg) -^,,^00 7^ 51 + Y i^i^i + n^)' 

^ q,e<£Q q,eeQ^q^i ^ 
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where for any 1 < i < the random variables Vq^ are independent, with dis- 
tribution J\f{0,Ya.Tlg{Xi2)\ZiqZ2i = 1)) and = (T^i, . . . , Wq) is independent 
from the Vqis, with distribution A/'q(0, S), and in the equal group proportions 
case E simplifies to S^^ = when q ^ £ and Egg = {Q — 1)/Q^. 

In the same way, whenever fc > 3, all the terms appearing in Ti are now 
negligible and we get 



n{rhg ~ mg) ^ {WqWe + -^). 




□ 

Proof of Theorem 3. Following the proof of Theorem 1, we can easily write a 
joint central limit theorem for the triplet (7711,7712,7713). Namely, 



with some covariance matrix V. Thus, we can apply a delta-method [see for 
instance van dor Vaart, 1998, Chapter 3] to the estimators $ = 0(7711,7712,7713) 
and a = '(/'(ttii, 7712, 7773) where the functions (f> and ip are differentiable. This 
gives the convergence of the estimators (a, /3) and guarantees the same rates of 
convergence for a,/3 than for the Th^'s. □ 

Proof of Theorem 4- Following the classical proof of Wald [1949] [see also van der 
Vaart, 1998], we may obtain the almost sure convergence of (7„,q;„,^„) to the 
true value of the parameter {j* , a* , f3*) , provided the parameter space is com- 
pact and the three following assumptions are satisfied: 

i) Convergence of the criterion 

^„(7r,a,/3) — — V \ogF^^a.i3{Xij,Xik,Xjk) 

71 7i — 1 (77 — 2) ^-^ 

(X12, Xi3, X23), 

-almost surely ; 
a) Identification of the parameter (7, a, j3) 

H{{7v, a, /?); (tt*, a\ (3*)) < H{{n\ a\ 13*); (tt*, a^ ^ )), 

with equality if and only if (7,0,/?) = (7* , a* , /3* ) , where 7 and tt are 
related through (10); 
in) Uniform equicontinuity of the family of functions (tt, a, /3) ^n(7r, a, /?). 
Namely, for any e > 0, there exists some > such that for all 71 > 1 
and as soon as \\{TT,a,f3) — (tt', a', /3')||oo < J^, we have |€„(7r, a, /3) — 
4(7r',a',/3')l <e. 
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Item i) follows from Theorem 1, while ii) follows from Jensen's inequality and 
identifiability of the parameters, i.e. Assumption 1. Let us know establish in). 
We fix for the moment some v > and consider rj — (7r,a,/3) and 77' = 
(7r',a',^') such that H??- r;'||oo < i^- We recaU that {X,j, X,k, Xjk) = X^'-^-'''>. 
We then write 

\\og¥„{Z,gZ,eZk^ = 1,X(^'^"''=)) -logP^-(Z,gZ,,Zfe„ - 1, X(^^^-'=))| 

< I logTTq - logTT^I + I logTT^. - logTT^I + | log 7r„ - log TT^, | 

+ I logP,(X(''^''=)|Z,,Z,,Zfc™ = 1) - logP^,(X(''^''=)|Z,,Z,,Zfe„ = 1)|. 

The second term in the right hand side of this inequality may be bounded as 
follows 

|logP,(X(^'^'^'=)|Z,,Z,,Zfe™ - l)-\og¥^,{X^''^-'^\Z,gZ,eZkm = 1)1 

< 3max(|loga - loga'|, | log(l - a) - log(l - «')!, | log^ - log/3'|, 

|log(l-/3)-log(l-/3')|). 

We now make use of the fact that we restricted our attention to the parameter 
space n^, in which all the parameters are lower bounded by d (Assumption 2). 
Moreover, for any x,y > 0, we may use jloga; — logy| < \x — y\/ imii{x,y). This 
finally leads to 

|logP^(Z,,Z,,Zfc,„ - 1,X(*'^'^'=)) -logP^-(Z,,Z,,Zfe„ = 1,X(*'J''^))| < 6S-'iy. 
Now, we obtain 

P,(X(''J"^'=)) =Y,PviZ:qZ,fZk,n = 1,X(^^^'''=)) 

< exp(6riz.)^P^.(Z,,Z,,Zfc„ = 1,X(''-'"^")) = exp(6(5-V)P^,(X('^^"'*^)), 

q£m 

and thus 

logP^(X(''J''=') < y +logP^-(X('J''=)). 
As this inequality is symmetric with respect to rj and ry', we further obtain 

|logP,(X(*-^'^'=))-logP„KX(''^'^'^))| < y. 

Finally, 

\in{v)-enW)\ <-, ^7 -^|l0gP,(X(''^''=))-l0gP„(X(''^^''^-))| < ^, 

nin — l)in — 2) ■^-^ 
which establishes in). 
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To further obtain the rates of convergence of the estimators, one usuaUy 
proceeds to a Taylor expansion of the derivative d£n{T^*, a*, f3*) in a vicinity of 
the estimator (-TTn, a„, /3„). Let us write 

where (7r„, a„, /?„) is some point between (7r„, Q!„, /3„) and (tt*, a*, /?*). Ap- 
plying Theorem 1 to the quantity 9^„(7r*, a*, we obtain its almost sure 
convergence to E7r*,Q*,^* (91og P7r*,a*,^* (^12, ^13, -'^23)) = 0, as well as the 
asymptotic normality 

y/ndin{n\a\P*) ->„^oo AA(0, J). 

Now, at a fixed point (tt, a, l3), the Hessian matrix 3^i?„(7r, a, (3) converges from 
Theorem 1 to E7r*,Q«,^* (9^ logP7r,a,^(-'^i2, -^13, -^23))- Combining the almost 
sure convergence of (7r„, d„, /3„) to (tt*, a*, /?*), with uniform equicontinuity 
of the family of functions (tt, a, /?) — > d'^£n{7r, a, f3) (the proof is similar to point 
in) above and is therefore omitted), we obtain the almost sure convergence 

9^4i(7r„,a„,/3„) -^n^oo K„\a\p* {d'^ ^ogF„* ^p* {X12, X13, X23)) -K- 

If the Fisher information matrix K is invertible, we obtain 

y^[(7^„, d„, /3„) - (tt*, a*, 13*)] AA(0, R-^JR-'). 

In this case, the inverse of the limiting variance is known as Godambe informa- 
tion [Varin, 2008]. Its form is due to the fact that 7^ J in general, resulting 
in a loss of efficiency of the estimators. In case where K is not invertible, or 
when J = 0, the rate of convergence of the estimators is faster than 1/^/n. In 
particular, when the group proportions are equal, we know from Theorem 1 that 
n9-^„(7r*, a*, /?*) converges in distribution and then the rate of convergence of 
(■7r„, q;„, /?„) is at least 1/n. □ 

Proof of Theorem 5. The proof follows the scheme described in the proof of 
Theorem 4. We denote by (tt*, p*, 6*) the true value of the parameter and by 
P* and E* the corresponding probability and expectation. First, let us establish 
the consistency of the normalized composite likelihood (point i)). According to 
Theorem 1, we have for any fixed value of (tt, p, 6), 

— ^ 5^ lx.,^ologP.,p,.(X,,)^^^E^(lx,,^ologP.,p,e(Xi2)) 

:=H{{n,p,e);{7T*,p\e*)), P* a.s. 

Here, we need to deal with the fact that we use a random value for p (a pre- 
liminary step estimate) in the definition of ^. It is thus necessary to prove that 
this convergence happens uniformly with respect to p. But this is going to be 
a consequence of point in) below. Combining this with the almost sure conver- 
gence of p„ to the true value p* (this is either a consequence of Theorem 1 when 
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p — p is constant, or a consequence of Sections 3.1 and 3.2 when p — (a,/3)), 
we get 

— ^/:™(7r,p„,0) ~-> H{{7T,p,e);{7T*,p\e*)), a.s. 

n[n — I) ri-i.+oo 

Moreover, we assumed that f{-,9) has a continuous c.d.f. and the distribution 
of a present edge is given by (16), so that we have 

HUtv, p, 9); (tt*, p*, 0*)) = / log(7i„/(x; 9-,^) + 7out/(x; 0out)) 

J X 

where (7in,7out) as well as (7i*i,7out) ^'^^ defined through (7r,p) and (7r*,p*) 
respectively. Thus, the difference 

H{{tv\p\ r); (tt*, p*, r)) - i/((7r, p, 0); {tv\p\6*)) 

is a KuUback-Leibler divergence between two mixture distributions of the form 
(16). This entails positivity of this difference. Moreover, Assumption 3 ensures 
that the difference is zero if and only if 

linSe,^ + 7out<5e„„, = 7in'5e*„ + llnt^e^^^^, 

which establishes point zi), up to a permutation on the label parameters {in, out}. 
Finally, the proof of point in) follows the same lines as in the proof of Theo- 
rem 4, and uses the continuity of the map 9 f->- f{-,0), which is a consequence 
of Assumption 4. 

To further obtain the rates of convergence of our estimators, we proceed 
exactly as we did in the proof of Theorem 4. □ 

Acknowledgments The authors thank Jerome Dedecker for helpful insights 
concerning this work as well as the associate editor and an anonymous referee 
for their remarks leading to considerable improvements on the manuscript. The 
authors have been supported by the French Agence Nationale de la Recherche 
under grant NeMo ANR-08-BLAN-0304-01. 

References 

Airoldi, E., D. Blei, S. Fienberg, and E. Xing (2008). Mixed-membership 
stochastic blockmodels. Journal of Machine Learning Research 9, 1981-2014. 

Allman, E., C. Matias, and J. Rhodes (2009). Identifiability of parameters in 
latent structure models with many observed variables. Ann. Statist. 57(6A), 
3099-3132. 

Allman, E., C. Matias, and J. Rhodes (2011). Parameters identifiability in 
random graph mixture models. Journal of Statistical Planning and Inference. 
To appear, doi:10.1016/j.jspi.2010.11.022. 



C. Ambroise and C. Matias/Estimation in random graph mixtures 



36 



Barrat, A., M. Barthelemy, R. Pastor-Satorras, and A. Vespignani (2004). The 
architecture of complex weighted networks. PNAS 101(11), 3747 3752. 

Boccaletti, S., V. Latora, Y. Moreno, M. Chavez, and D. Hwang (2006). Com- 
plex networks: Structure and dynamics. Physics Reports 424(4^-5), 175-308. 

Carreira-Pcrpihan, M. A. and S. Renals (2000). Practical identifiability of finite 
mixtures of multivariate Bernoulli distributions. Neural Comp. 12{1), 141- 
152. 

Cox, D. R. and N. Reid (2004). A note on pseudolikelihood constructed from 

marginal densities. Biometrika 91 (3) , 729-737. 
Daudin, J.-J., F. Picard, and S. Robin (2008). A mixture model for random 

graphs. Stat. Comput. 18(2), 173-183. 
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from 

incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 59(1), 

1-38. 

Doreian, P., V. Batagelj, and A. Ferligoj (2005). Generalized Blockmodeling. 

Cambridge University Press. 
Erdos, P. and A. Renyi (1959). On random graphs. I. Puhl. Math. Debrecen 6, 

290-297. 

Erosheva, E., S. Fienberg, and J. Lafferty (2004). Mixed-membership models of 

scientific publications. PNAS P7(22), 11885 -11892. 
Fortunato, S. (2010). Community detection in graphs. Physics Reports 4^6 {3- 
5), 75-174. 

Frank, O. and F. Harary (1982). Cluster inference by using transitivity indices 

in empirical graphs. J. Amer. Statist. Assoc. 77(380), 835-840. 
Goldenberg, A., A. X. Zheng, S. E. Fienberg, and E. M. Airoldi (2010). A survey 

of statistical network models. Found. Trends Mach. Learn. 2 (2), 129-233. 
Gunawardana, A. and W. Byrne (2005). Convergence theorems for generalized 

alternating minimization procedures. J. Mach. Learn. Res. 6, 2049-2073. 
Holland, P., K. Laskey, and S. Lcinhardt (1983). Stochastic blockmodels: some 

first steps. Social networks 5, 109-137. 
Hubert, L. and P. Arable (1985). Comparing partitions. J. Classification 2, 

193-218. 

Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and 
Models. Springer. 

Laloux, L., P. Cizeau, J. -P. Bouchaud, and M. Potters (1999). Noise dressing 
of financial correlation matrices. Phys. Rev. Lett. 83(7), 1467-1470. 

Latouche, P., E. Birmele, and C. Ambroise (2011a). Overlapping stochastic 
block models with application to the french political blogosphere. Ann. Appl. 
Stat.. To appear. 

Latouche, P., E. Birmele, and C. Ambroise (2011b). Variational bayesian infer- 
ence and complexity control for stochastic block models. Statistical Modelling. 
To appear. 

Mariadassou, M., S. Robin, and C. Vacher (2010). Uncovering latent structure 

in valued graphs: a variational approach. Ann. Appl. Stat. ^(2), 715-42. 
Newman, M. E. J. (2004). Analysis of weighted networks. Phys. Rev. E 70, 
056131. 



C. Ambroise and C. Matias/Estimation in random graph mixtures 



37 



Newman, M. E. J. and E. A. Leicht (2007). Mixture models and exploratory 

analysis in networks. PNAS 104{2:i), 9564 9569. 

Nowicki, K. and T. Snijders (2001). Estimation and prediction for stochastic 
blockstructures. J. Amer. Statist. Assoc. 96(455), 1077-1087. 

Picard, F., V. Miclc, J.-J. Daudin, L. Cottrct, and S. Robin (2009). Decipher- 
ing the connectivity structure of biological networks using MixNet. BMC 
Bioinformatics 10, 1-11. 

Pictcrs, F. and H. Baumgartncr (2002). Who talks to whom? intra- and inter- 
disciplinary communication of economics journals. Journal of Economic Lit- 
erature 40(2), 483-509. 

Snijders, T. A. B. and K. Nowicki (1997). Estimation and prediction for stochas- 
tic blockmodels for graphs with latent block structure. J. Classification 14 (1), 
75-100. 

Tittcrington, D., A. Smith, and U. Makov (1985). Statistical analysis of fi,nite 
mixture distributions. Wiley Series in Probability and Mathematical Statis- 
tics. 

van dcr Vaart, A. W. (1998). Asym,ptotic statistics, Volume 3 of Cambridge 
Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge 
University Press. 

Varin, C. (2008). On composite marginal likelihoods. AStA Adv. Stat. 
Anal. 92{l), 1-28. 

Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. 

Ann. Math. Statistics 20, 595-601. 
Wu, C. (1983). On the convergence properties of the EM algorithm. Ann. 

Statist. 11(1), 95-103. 
Zanghi, H., C. Ambroise, and V. Miclc (2008). Fast online graph clustering via 

Erdos Renyi mixture. Pattern Recognition ^-?(12), 3592-3599. 
Zanghi, H., F. Picard, V. Miele, and C. Ambroise (2010). Strategies for online 

inference of model-based clustering in large and growing networks. Ann. Appl. 

Stat. 4(2), 687-714. 

Ziberna, A. (2007). Generalized blockmodeling of valued networks. Social Net- 
works 29(1), 105 - 126. 



